DOCUMENT RESUME 



ED 399 775 



FL 024 100 



AUTHOR 
TITLE 
PUB DATE 
NOTE 

PUB TYPE 
JOURNAL CIT 



Keller, Eric; Zellner, Brigitte 
A Timing Model for Fast French. 

Mar 96 

25p. ; For complete volume, see FL 024 097. 

Reports “ Evaluative/Feasibility (142) — Journal 
Articles (080) 

York Papers in Linguistics; vl7 p53"75 Mar 1996 



EDRS PRICE MFOl/PCOl Plus Postage. 

DESCRIPTORS Foreign Countries; '‘French; Language Fluency; 

^Language Patterns; Language Research; '"Language 
Rhythm; Linguistic Theory; '"Oral Language; 
*Phonology; Statistical Analysis ; ’’"Syllables 



ABSTRACT 

A three“tiered statistical model for predicting the 
temporal structure of French, as produced by a single, highly fluent 
subject at a fast speech rate, is outlined. The first tier models 
segmental influences due to phoneme type and contextual interactions 
between phoneme types. The second tier models syllable-level 
influences of lexical versus grammatical status of the containing 
word, presence of schwa, and the position within the word. The third 
tier models utterance-final lengthening. The output of the complete 
model correlates with the original corpus of 1204 syllables at an 
overall r=0.846. However, an examination of subsets of the complete 
data set revealed considerable variation in the closeness of fit of 
the model. Residuals have a normal distribution. Contains 33 
references. (MSE) 



?V * * Vc * * "k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k 

* Reproductions supplied by EDRS are the best that can be made ’" 

* from the original document. * 

* * * * k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k 



k k k k k k k k k k k k k k k k 



ED 399 






A TIMING MODEL FOR FAST FRENCH* 



Eric Keller and Brigitte Zellner 



PERMISSION TO REPRODUCE AND 
DISSEMINATE THIS MATERIAL 
HAS BEEN GRANTED BY 




TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) 



Office of cenviRCES INFORMATION 



ent do no\ neCe°sM 

OERl position or policy. 






0 

0 

;eric 



2 



A TIMING MODEL FOR FAST FRENCH^ 



Eric Keller and Brigitte Zellner 
University of Lausanne 



1« Introduction 

Previous research on the prediction of speech timing has documented 
influences at three major levels: the phoneme or segmental, the syllabic 
and the phrase level. In this paper we describe a three-tiered statistical 
model which has been created for predicting the temporal structure of 
French, as produced by a single, highly fluent speaker at a fast speech 
rate. The first tier models segmental influences due to phoneme type and 
contextual interactions between phoneme types. The second tier models 
syllable-level influences of lexical vs. grammatical status of the 
containing word, presence of schwa and the position within the word. 
The third tier models utterance-final lengthening. The output of the 
complete model correlates with the original corpus of 1204 syllables at 
an overall r = 0.846. However, an examination of subsets of the 
complete data set revealed considerable variation in the closeness of fit 
of the model. Residuals have a normal distribution. 



1«1« Models Based on the Prediction of Segmental 
Durations 

The most influential statistical model for spoken French text has 
probably been the model proposed by O’Shaughnessy (1981, 1984). On 
the basis of numerous readings of a short text containing all phonemes 
of French, a model of durations of acoustic segments suitable for 
synthesis by rule was proposed. In this model, 33 rules for the 
modification of segment duration according to segment type, segment 
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position and phoneme context served to specify basic phoneme 
durations. - - 

Eor-sound'classes that did not involve prepausal lengthening, the 
model was able to predict the durations for 281 segments of a text with 
a standard deviation of 9 ms. But it was less accurate for the prediction 
of prepausal vowel durations, because of the greater variability of 
segments in such positions. Moreover, this model was not able to 
predict silent inter-lexical pauses. 

O’Shaughnessy’s statistical model is constructed around the 
hypothesis that speech timing phenomena can be captured by the 
segment, as if this unit “possesses an inherent target value in terms of 
articulation or acoustic manifestation” (Fujimura 1981). However, 
recent measures have indicated that syllable-sized durations are generally 
less variable than subsyllabic durations, and thus may represent more 
reliable anchor points for the calculation of a general timing structure 
than segmental durations (Barbosa and Bailly 1993; Keller 1993; Zellner 
1994). The taking into account of explicit syllable-level information is 
further supported by the observation that stress variations and variations 
of speech rate tend to modify at least syllable-sized units. 

Bartkova’s model (1985, 1991) attempts to solve these deficiencies 
by adding calculated coefficients to the formula for predicting segment 
durations: 

Dur Seg= Durl + kSyll+ kAc 

where Durl is the intrinsic duration of the segment, ksyii is a syllabic 
coefficient, and kAc an accentuation coefficient. The exact manner in 
which these coefficients are obtained is not described; it is only noticed 
that they can vary from a minimum to a maximum interval, according 
to the position of the segment in the speech chain, and according to the 
acoustic properties of the speech sound. 

The syllabic coefficient depends on the nature of the word 
(lexical/grammatical), and on the position in the word (initial, medial, 
final syllable). The coefficient of accentuation depends on the next 
consonant, on the presence/absence of a syntactic boundary in the case 
of a final vowel, or on the presence/absence of clusters in the case of a 
final consonant, as well as on the syllabic structure near a pause. 
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According to Bartkova, a comparison of predicted and measured 
durations in 10 sentences gives rather good predictions, since the mean 
difference on segmental duration is about ±13 ms. 

However, it would seem that beyond the opacity of the coefficients, 
a divergence between predicted and measured durations of the order of 15 
to 30 ms can be a major handicap for short segments. In our corpus, for 
example, the mean duration for /d/ was 50 ms. In the case of such a 
short phoneme, a 15-30 ms divergence would correspond to an error of 
30-60% with respect to its measured duration. 



1.2. Required Macro-timing Information 
Since the segmental unit cannot capture the overall temporal structure 
of speech, the next level which can be expected to encapsulate temporal 
phenomena is the syllable. This appears to be a good candidate. 
According to some psycholinguists, it is considered to be the minimal 
perception unit, and according to a number of phoneticians and 
phonologists, it is the minimal unit of rhythm (see Delais 1994). 

It has been shown that quite a number of parameters are involved in 
variations of syllabic duration. The most important are: the position in 
the prosodic group, the position in the word, degree of stress, the length 
of the prosodic group, the position according to the stressed syllable, 
the position according to the local speech rate (as measured by cycles of 
speeding up and slowing down), semantic focus, proximity of syntactic 
boundaries, the status of the word (lexical or grammatical), and 
emotional factors (Bankova 1985, 1992; Campbell 1992; Delais 1994; 
Duez, 1985, 1987; Fant and al. 1991; F6nagy 1992; Gr6goire 1899; 
Grosjean et al. 1975, 1983; Guaitella 1992; Konopczynski 1986; 
Martin 1987; Mertens 1987; Monnin et al. 1993; Pasdeloup 1988, 
1990, 1992; Wenk et al. 1982; Wunderli. 1987). Some of these factors 
may be redundant; for instance, in many cases of read text, lexeme-final 
position may be redundant with phrase-final position. 

In view of existing information, it thus seems best to begin with 
segmental predictions, and to consider syllabic information as additional 
information which is not captured at the segmental level. One of the 
important points to consider in the present study will be the selection of 
non-redundant and relevant information. 
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Beyond the syllabic level, it is likely that a good predictive model 
will eventually need to incorporate further information at the word or 
the phrase level. For example, the prediction of pauses for slow speech 
requires phrasal knowledge, which is not captured at the segmental or at 
the syllabic l evel . In th e area of word group b ounda ries in Frenc h 
speech, a great deal of work has been accomplished to determine the 
nature of these groups — syntactic groups, prosodic groups, rythmic 
groups, intonational groups, the congruence between these labels — 
and to calculate the automatic generation of such groups and potential 
inter-group pauses (Delais. 1994; Grosjean et al. 1975; Keller et al. 
1993; Martin 1987; Monnin et al. 1993; Pasdeloup 1988; Saint-Bonnet 
et al, 1977). These effects will have to be integrated into a general 
timing model for a given language, but were not taken into account in 
the present study. 

In the current study, the objective was to account for a single 
speaker’s syllable durations with the smallest number of segmental and 
syllabic factors. At each succeeding level, relevant parameters were 
chosen so as to explain the greatest proportion of the variance in the 
residue of the previous analysis. In this manner, a three-tier model, 
based successively on segmental, syllabic and phrasal information, was 
constructed. 



2 . Method 
2.1. The corpus 

A highly fluent speaker of French (a professor of French literature) was 
recorded with 277 sentences, the first 100 of which were analysed for the 
present study. The speaker was instructed to speak quite rapidly, with a 
normal, unexaggerated intonation. The resulting readings have generally 
been Judged by listeners as highly intelligible and well-pronounced. No 
dialectal particularities were noted. 

Recording occurred in studio conditions on DAT-tape. The digitized 
data was transferred to Macintosh computer and was downsampled to 16 
kHz. 



A TIMING MODEL FOR FAST FRENCH 



2.2. Time labelling 

The time occupied by each phoneme was labelled with the Signalyze^M 
program according to detailed instructions on how to handle phoneme- 
to-phoneme transitions (Th6voz and Enkerli 1994). Specifically, 
transitions in the acoustic corpus was analyzed according to three 
articulatory levels: labial, lingual and laryngeal. For example, the 
coarticulatory overlap at the /e/-/s/ transition was marked by symbols 
representing the following events: “onset of friction, associated with the 
lingual level”, followed at a given time interval by an "offset of 
fundamental frequency, associated with a cessation of vocal cord 
activity”. The following possible states were distinguished: 

Labial system: aperture, occlusion, friction, burst, error 

Lingual system: aperture, occlusion, friction, burst, palatal, 
transient movement, error 

Laryngeal system: aperture, occlusion, transient movement, 
diminution, error 

“Error” refers to any state that occurs inadvertently, such as during a 
speech error. 

To examine the reliability of transcriptions, two judges compared 
Judgements concerning how and where points of transition between 
inferred articulatory states were to be marked. Two measures of 
inteijudgemental agreement were used: 

Robustness (agreement in the application of criteria to state 
transition), scored 1 = low agreement, 2 = agreement in general, but 
some further discussion required, and 3 = excellent agreement. 

Precision, scored 1 = more than two Fo periods difference, 2 = 1-2 
Fo periods difference and 3 = less than 1 Fo period difference in 
measurement. 

Both measures showed good to excellent interjudgemental 
agreement. Over the 50 types of state transitions examined, there were 
no cases of low robustness or low precision. The average robustness 
was 2.53 and the average precision was 2.68. 

A total of 4544 phonemes and 1203 syllables were analyzed in this 
manner. 
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3. Analysis and Results 

A modified step-wise statisticaf regression technique was used to 
develop a well-fitting model of this speaker’s timing behaviour. In 
accordance with previous observations on factors that influence sp)eech 
timing, it was decided to model three major levels: the segmental, the 
syllabic and the phrase level. In step-wise fashion, each succeeding level 
was made to model the residue left by the previous level. Three different 
models were thus established, the Segmental, the Syllabic and the 
Phrase Model (Figure 1). 



The 

Segmental 

Model 






The 

Syllabic 

Model 


The 

Phrase 

Model 



Figure 1. The Segmental. Syllabic and Phrase Models. Each subsequent 
model incorporates the modelling effects of the previous level. 



3.1. Model 1: The Segmental Model 

Segmental Durations and Overlap Zones. An initial issue concerned the 
calculation of segmental duration in a corpus where coarticulatory 
transition zones are marked explicitly. Does phoneme duration 
correspond to the zone of the signal which is unambiguously marked for 
a given phoneme (zone B in figure 2), or does it include one or both 
zones of coarticulatory overlap with adjoining phonemes (zones A and 
C in figure 2)? 
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overlap 1 overlap 2 







“unambiguous” 

zone 






/s/ 








Id 




A 


B 




/R/ 


C 





Figure 2. What constitutes a phoneme? B is a portion of the signal that 
is unambiguously marked for the phoneme /r/, while A and C are 

transitory zones with adjoining phonemes. 

The issue was resolved with reference to durational variation. The 
combination of zones A, B and C (with an average coefficient of 
variation of 0.375) turned out to be systematically less variable than the 
unambiguous zone B (with an average coefficient of variation of 0.412) 
(see Table 1). 





A 


B 


c 


Average coefficient of 
variation (s.d./ mean) 
for 34 phonemes 


1.6379 


0.4123 


1.7472 




A + B 


B + C 


A + B + C 


Average coefficient of 
variation for 34 
phonemes 


0.3916 


0.3933 


0.3751 



Table L Coefficients of variation for zones A, B and C as well as 
various combinations of these zones 



Also, combinations of zones A and B, or of B and C, were less variable 
than zone B alone. The transition zones can thus be considered to be 
“buffer zones” whose function, in part, may well be to “regularise” 
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phoneme duration. For the purpose of the present-research it was thus 
decided to cojisider-the combined duration of A, B and C as “phoneme 
duratiori”.^yllable durations were constructed from phoneme durations 
by taking into account transitional overlaps. As a net effect, the 
segmental duration entering the statistical modelling procedure is 
slightly more regular than more commonly measured phoneme 
durations. Nevertheless, it is not believed that the modelling results of 
the present study seriously depend on this manner of proceeding; the 
size and resilience of the measured effects suggest that as long as 
transitions are handled in systematic fashion, the predictive pattern 
should remain largely identical. 



3,2 Segmental transformation and grouping. 

Raw segment durations were non-normal in their distribution. Among 
the common transformations, the log 10 transformation produced the 
closest approximation to a normal distribution (Figure 3a, b). All 
calculations of the segmental portion of the model were thus performed 
on loglO-transformed durations. 





Iog10 (ms) 



Figure 3a. The distribution of segment durations before and after the log 
10 tranrformation: histograms. 
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nscoros nsoores 

Figure 3b, The distribution of segment durations before and after the log 
10 transformation: normal probability plots. 

Subsequent to transformation, phonemes were grouped according to 
their mean durations and their articulatory definitions. Eight classes 
could be identified (Table 2). Groups showed roughly comparable 
coefficients of variation, and an inspection of histograms and normal 
probability plots showed roughly normal distributions for all classes 
whose N was greater than 100. 



Phoneme type 


Name 


Mean duration 
(ms) 


ce ,0 


AntRound 


109.45 


/sf 


Fric 


105.17 


oe, £, a, 6 


Nas 


97.78 


o 


PostMidRnd 


94.92 


p,t,k 


UnvPlos 


92.94 


a,e,e,3,u,i,y 


OlhVow 


69.62 


b/ Z/ III/ ^ g, V/ ^ H/ 

d,? 


VcdCons 


61.72 




SemiVLiquids 


43.63 


Mean 




90.23 



Table 2. Mean durations for phoneme classes (N = 4544) 



er|c 
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Phoneme type 


Coefficient of variation 
(s.d./mean) 


Frequency 

(AO 


ce,0 


0.4881 


71 


Jsf 


0.2708 


357 


de, e, a, 5 


0.3585 


334 


o 


0.3130 


60 


p, t, k 


0.3475 


504 


a,e,e,D, u,i,y 


0.4089 


1557 


b, z, m, g, g, V, 3 , n. 


0.3669 


892 


R/ j, w, 1 , q 


0.4908 


769 


Mean 


0.3648 


539 



Table 2. (continued) Mean durations for phoneme classes (N = 4544) 



To test Model I in the syllabic context, square root-transformed syllable 
durations were calculated on the basis of coefficients produced by the 
linear model for segmental durations, and by taking into account mean 
durations of phoneme-to-phoneme transitions. These calculated syllable 
durations were compared to the square root-transformed measured 
syllable durations. The correlation coefficient was r = .647 (N = 1203, 
/K.OOOl) (Figure 5). 
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8 



■4 I 1 h 

6 9 12 15 

Model 1 



Figure 5. Prediction of the Segmental Model (Model 1): Syllable 
durations predicted exclusively on the basis of segmental durations (r = 
.647), Values are in sqrt(ms). 

The residue from the model (= observed - predicted) was termed “Delta 
1” and served as the basis for further factorial modelling at the syllabic 
level. 



3.3 A Linear Model for Segmental Durations. 

Using the Data Desk® statistical package on the Macintosh, a general 
linear model for discontinuous data (based on an ANOVA) was 
calculated with partial (non-sequential, Type 3) sums of squares. The 
following main and interaction factors (up to two-way^) were 
postulated: 

duration (loglO(ms)) = constant + previous type + current type + next 
type + previous type * current type + current type * next type + 
previous type * next type 



^ For reasons of insufficiency in per-cell observations, calculation 
complexity and theoretical difficulty of interpretation, three-way 
interactions were not calculated. 
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Table 3. The Segmental Analysis of Variance for Segmental 

Data (N = 4544) Using Partial Sums of Squares 



Source 


df 


Sums of Squares 


Mean Square 


Const 


1 


14903.8 


14903.8 


previous 


8 


0.123239 


0.015405 


current 


7 


3.13402 


0.447717 


next 


8 


0.267002 


0.033375 


previous * current 


50 


3.24144 


0.064829 


current * next 


50 


5.04499 


0.100900 


previous * next 


60 


1.79531 


0.029922 


Error 


4360 


101.137 


0.023197 


Total 


4543 


196.070 





Source 


df 


F-ratio 


Prob 


Const 


1 


642500 


< 0.0001 


previous 


8 


0.66410 


0.7236 


current 


7 


19.301 


< 0.0001 


next 


8 


1.4388 


0.1748 


previous * current 


50 


2.7948 


< 0.0001 


current * next 


50 


4.3498 


< 0.0001 


previous * next 


60 


1.2899 


0.0665 


Error 


4360 






Total 


4543 







In the partial sums of squares solution, all factors were significant at 
p<.05, with the exception of “previous type” and “next type”, taken 
alone, and the interaction term “previous type * next type” (Table 3). 
The residual error was 101.137/196.070 = 0.516, that is, the model 
explained about 48.4% of the variance. Expressed in terms of a Pearson 
product-moment correlation, the model’s predicted segmental durations 
correlated with empirical phoneme durations at r = 0.696. 
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3.4 Syllable Durations and Delta h 
Another means of testing the model is a comparison with measured 
syllable durations. In contrast to phoneme durations, where a log 
transformation served to provide roughly normal distributions, square 
roots had to be applied to measured syllable durations in order to 
approximate normal distributions (Figure 4). 




sqrtMeas nscores 

Figure 4, Syllable durations in ms were square-root tranrformed in order 
to approximate a normal distribution. 



3.4. Model 2: The Syllabic Model 

Syllabic Factors Predicting Delta L After considerable experimentation 
with a variety of factors described in the literature, a three-factor model, 
including two-way interactions, was retained for analysis; 

della 1 = constant + function + position + schwa + function * position 
+ function * schwa + position * schwa, 

where 'Junction'* distinguishes whether the syllable is found in a lexical 
or a function word, "position** identifies three types of position in the 
word which are (1) “monosyllabic and polysyllabic-initial”, (2) 
“polysyllabic pre-schwa” and (3) “other”, and “schwa” indicates whether 
or not a schwa is present in the syllable. Again, a general linear model 
for discontinuous data was calculated with partial (Type 3) sums of 
squares. The results of the ANOVA showed that all main and interaction 
factors were significant at /k. 05 (Table 4). The residual error of 
3277.29/5432.93 = .6 indicated that the model explained 40% of the 
variance in Delta 1. 
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Table 4. Analysis of Variance for Delta 1 (N = 1202) 
Using Partial Sums of Squares 



Source 


df 


Sums of 
Squares 


Mean Square 


Const 


1 


2663.53 


2663.53 


function 


1 


176.508 


176.508 


position 


2 


98.5753 


49.2877 


schwa 


1 


149.296 


149.296 


function * position 


2 


97.3872 


48.6936 


function * schwa 


1 


27.5860 


27.5860 


position * schwa 


2 


63.0467 


31.5234 


Error 


1193 


3277.29 


2.74710 


Total 


1202 


5432.93 





Source 


df 


F-ratio 


Prob 


Const 


1 


969.58 


< 0.0001 


function 


1 


64.252 


< 0.0001 


position 


2 


17.942 


< 0.0001 


schwa 


1 


54.347 


< 0.0001 


function * position 


2 


17.725 


< 0.0001 


function * schwa 


1 


10.042 


0.0016 


position * schwa 


2 


11.475 


< 0.0001 


Error 


1193 






Total 


1202 







Model 2 and Delta 2. Syllable durations obtained from the segmental 
model were combined with those from the present linear model for Della 
1 to produce the Syllabic Model (Model 2). The predictions correlated 
with observed square root-transformed syllable durations at r = .723 
(Af=l203) (Figure 6). The residual data was termed Delta 2. 
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Figure 6. Prediction of the Syllabic Model (Model 2): Syllable 
durations predicted on the basis of segmental durations and syllable-level 
factors (r = .723). Values are in sqrt(ms). 



3.5. Model 3: The Phrase Model 

Inspection of the predictions of Models 1 and 2 (Figures 5 and 6) 
showed a noticeable deviation from the regression line in the higher 
values. Specifically, these models underestimated most syllable 
durations in the > 280 ms range. Furthermore, an examination of Delta 
2 revealed that the residual error was most pronounced for utterance-final 
syllables ending in a consonant. Consequently, a correction term was 
calculated, which was applied to such syllables in Model 3. 

The predictions of Model 3, which incorporates segmental and 
syllabic modelling as well as the phrase-final correction term, correlated 
with the observed square root-transformed syllable durations at r = .846 
(Figure 7). The residual values from Model 3 vary quasi-randomly 
around 0. At the present time, it appears that only more sophisticated 
rules for the generation of the schwa vowel may still be able to improve 
this model’s predictive capacity to some degree. 
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Model 3 

Figure 7. Prediction of the Phrase Model (Model 3): Syllable durations 
predicted on the basis of segmental durations, syllable-level factors and 
phrase-final lengthening (r = ,846). Values are in sqrt(ms). 



3.SA. Stability 

The Phrase Model was examined for its predictive stability by 
performing Pearson product-moment correlations between various 
subsamples of the data and the model’s prediction. The resulting data is 
presented in Table 5. 

Table 5, Pearson Product-Moment Correlations between Various 
Subsets of the Dataset and the Phrase Model* s Prediction 





slices of 50 
syllables 


slices of 100 
syllables 


1st slice 


0.9 


0.884 


2nd slice 


0.87 


0.872 


3rd slice 


0.853 


0.852 


4th slice 


0.89 


0.726 


5th slice 


0.866 


0.823 


6th slice 


0.852 


0.868 
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slices of 200 
syllables 


slices of 300 
syllables 


1st slice 


0.878 


0.869 


2nd slice 


0.789 


0.805 


3rd slice 


0.838 


0.874 


4th slice 


0.885 


0.838 


5 th slice 


0.841 




6 th slice 


0,838 





Table 5. ( Continued) Pearson Product-Moment Correlations between 
Various Subsets of the Dataset and the Phrase Model's Prediction 

It can be seen that the model’s predictive capacity varies considerably 
from one subset to the next. For example, the correlation was only .726 
for the fourth slice of 100 syllables in the set, while it had been .884 
for the first slice. Even when slices of 300 syllables are compared, 
considerable variability prevails. The reasons for these instabilities are 
presently being investigated. 



4. Discussion 

By a modified step-wise procedure, a general model for the prediction of 
the fast-speech performance of a highly fluent speaker of French was 
constructed. The initial model incorporates segmental information 
concerning type of phoneme and proximal phonemic context. The 
subsequent model adds information about whether the syllable occurs in 
a function or a lexical word, on whether the syllable contains a schwa 
and on where in the word the syllable is located. The final model adds 
information on phrase-final lengthening. The effects of these three 
levels are demonstrated on a single sentence in Figure 8. In view of 
current discussions surrounding segmental and syllabic contributions to 
timing models, it is interesting to note that segmental information 
accounts for a major portion of the variance explained by the model. As 
Figure 8 shows, segmental information alone successfully predicts 
several cases of major syllable lengthening. 
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Figure 8. A comparison of predictions of the three models and measured 
syllable durations for the sentence “Son itude ethnologique porte sur la 
relation entre les acupuncteurs et les centenaires afghans" . 

The overall correlation of 0.846 between predictions of Model 3 and the 
data set from which the model is derived is encouraging. This 
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correlation level corresponds roughly to the average inter-speaker 
correlation of r = 0.833 for phrase-final syllable durations, as measured 
between the readings of a short text by 12 speakers in the Caelen- 
Haumont corpus (Caelen-Haumont 1991; see Keller 1994). This means 
that the model behaves as differently from its target data as one natural 
speaker would behave with respect to another speaker. Although this 
may be an acceptable initial predictive level for synthesis purposes, 
further improvements in the modelling would be welcome. Preliminary 
indications suggest that such improvements may come about through 
predictions of the presence vs. the absence of schwa, through explicit 
predictions of the effects of speech rate manipulation, and in longer 
texts, through a better modelling of pauses. Further information on 
possible improvements may also be gained through an examination of 
cases of high delta 3 values in subsets of the present data set. These 
effects are currently being studied. 

It is worth noting that in the present fast-speech corpus, no phrase- 
level effects were identified, other than phrase-final lengthening. This is 
in contrast to our findings on the production of French at a normal 
speech rate, where a fairly systematic increase of lexeme-final syllable 
durations was observed over the extent of the prosodic phrase (Keller et 
al.. 1993). It seems likely that in conditions of considerably accelerated 
speech rate, our speaker sacrificed some of the “niceties” of phrase- 
internal timing modulation, and limited himself to a single, phrase-final 
durational marker. 

Considerably more work also needs to be done before the 
generalisability of the present model can be tested. The examination of 
the model’s stability has shown that predictions begin to show 
comparable strength at about 300 syllables or more. Consequently, 
systematic testing of these predictions for another speaker would 
involve a completely new research study. Nevertheless, a few quick 
examinations of predictions for another speaker’s sentences suggest that 
the model may indeed be generalisable to more than one speaker of 
French (Figure 9)^. 

^ The authors are grateful to the following members of the LAIP team for 
their invaluable assistance in scoring and creating the present corpus: 
Nicolas Thdvoz, Alexandre Enkerli, Hervd Mesot, Cddric Bourquart, Nicole 
Blanchoud, and Thomas Styger. Particular thanks go to Prof. J. Local (York 
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Figure 9, A comparison of predictions of Model 3 and the measured 
syllable durations of another speaker of French for the fast reading of the 
sentence ”Beaucoup de gouvernements voient le CERN comme un 
moteur de modernisation technologique” . 
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